{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "amazing-porter",
   "metadata": {},
   "source": [
    "# Norwegian Birth Numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "color-blond",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "current-soundtrack",
   "metadata": {},
   "source": [
    "The function `clean_no_fodselsnummer()` cleans a column containing Norwegian birth number (fodselsnummer) strings, and standardizes them in a given format. The function `validate_no_fodselsnummer()` validates either a single fodselsnummer strings, a column of fodselsnummer strings or a DataFrame of fodselsnummer strings, returning `True` if the value is valid, and `False` otherwise."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "inclusive-difficulty",
   "metadata": {},
   "source": [
    "fodselsnummer strings can be converted to the following formats via the `output_format` parameter:\n",
    "\n",
    "* `compact`: only number strings without any seperators or whitespace, like \"15108695088\"\n",
    "* `standard`: fodselsnummer strings with proper whitespace in the proper places, like \"151086 95088\"\n",
    "* `birthdate`: get the person's birthdate, like \"1986-10-15\".\n",
    "* `gender`: get the person's birth gender ('M' or 'F').\n",
    "\n",
    "Invalid parsing is handled with the `errors` parameter:\n",
    "\n",
    "* `coerce` (default): invalid parsing will be set to NaN\n",
    "* `ignore`: invalid parsing will return the input\n",
    "* `raise`: invalid parsing will raise an exception\n",
    "\n",
    "The following sections demonstrate the functionality of `clean_no_fodselsnummer()` and `validate_no_fodselsnummer()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hungry-secret",
   "metadata": {},
   "source": [
    "### An example dataset containing fodselsnummer strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "functional-columbus",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"fodselsnummer\": [\n",
    "            '15108695088',\n",
    "            '15108695077',\n",
    "            \"999 999 999\",\n",
    "            \"004085616\",\n",
    "            \"002 724 334\",\n",
    "            \"hello\",\n",
    "            np.nan,\n",
    "            \"NULL\",\n",
    "        ], \n",
    "        \"address\": [\n",
    "            \"123 Pine Ave.\",\n",
    "            \"main st\",\n",
    "            \"1234 west main heights 57033\",\n",
    "            \"apt 1 789 s maple rd manhattan\",\n",
    "            \"robie house, 789 north main street\",\n",
    "            \"1111 S Figueroa St, Los Angeles, CA 90015\",\n",
    "            \"(staples center) 1111 S Figueroa St, Los Angeles\",\n",
    "            \"hello\",\n",
    "        ]\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hawaiian-nickname",
   "metadata": {},
   "source": [
    "## 1. Default `clean_no_fodselsnummer`\n",
    "\n",
    "By default, `clean_no_fodselsnummer` will clean fodselsnummer strings and output them in the standard format with proper separators."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "brave-title",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_no_fodselsnummer\n",
    "clean_no_fodselsnummer(df, column = \"fodselsnummer\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "external-alexander",
   "metadata": {},
   "source": [
    "## 2. Output formats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "prime-evidence",
   "metadata": {},
   "source": [
    "This section demonstrates the output parameter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "individual-robin",
   "metadata": {},
   "source": [
    "### `standard` (default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "retired-participation",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_no_fodselsnummer(df, column = \"fodselsnummer\", output_format=\"standard\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "instant-broadcasting",
   "metadata": {},
   "source": [
    "### `compact`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "nutritional-verse",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_no_fodselsnummer(df, column = \"fodselsnummer\", output_format=\"compact\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "iraqi-copper",
   "metadata": {},
   "source": [
    "### `birthdate`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "considerable-roulette",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_no_fodselsnummer(df, column = \"fodselsnummer\", output_format=\"birthdate\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "breathing-lease",
   "metadata": {},
   "source": [
    "### `gender`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "blind-output",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_no_fodselsnummer(df, column = \"fodselsnummer\", output_format=\"gender\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "compound-indonesian",
   "metadata": {},
   "source": [
    "## 3. `inplace` parameter\n",
    "\n",
    "This deletes the given column from the returned DataFrame. \n",
    "A new column containing cleaned fodselsnummer strings is added with a title in the format `\"{original title}_clean\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "quality-september",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_no_fodselsnummer(df, column=\"fodselsnummer\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "informational-bennett",
   "metadata": {},
   "source": [
    "## 4. `errors` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "premier-merchant",
   "metadata": {},
   "source": [
    "### `coerce` (default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "impressed-description",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_no_fodselsnummer(df, \"fodselsnummer\", errors=\"coerce\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "intended-launch",
   "metadata": {},
   "source": [
    "### `ignore`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "specific-baltimore",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_no_fodselsnummer(df, \"fodselsnummer\", errors=\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "inappropriate-harmony",
   "metadata": {},
   "source": [
    "## 4. `validate_no_fodselsnummer()`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "solar-prototype",
   "metadata": {},
   "source": [
    "`validate_no_fodselsnummer()` returns `True` when the input is a valid fodselsnummer. Otherwise it returns `False`.\n",
    "\n",
    "The input of `validate_no_fodselsnummer()` can be a string, a Pandas DataSeries, a Dask DataSeries, a Pandas DataFrame and a dask DataFrame.\n",
    "\n",
    "When the input is a string, a Pandas DataSeries or a Dask DataSeries, user doesn't need to specify a column name to be validated. \n",
    "\n",
    "When the input is a Pandas DataFrame or a dask DataFrame, user can both specify or not specify a column name to be validated. If user specify the column name, `validate_no_fodselsnummer()` only returns the validation result for the specified column. If user doesn't specify the column name, `validate_no_fodselsnummer()` returns the validation result for the whole DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "choice-student",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_no_fodselsnummer\n",
    "print(validate_no_fodselsnummer(\"15108695088\"))\n",
    "print(validate_no_fodselsnummer(\"15108695077\"))\n",
    "print(validate_no_fodselsnummer(\"999 999 999\"))\n",
    "print(validate_no_fodselsnummer(\"51824753556\"))\n",
    "print(validate_no_fodselsnummer(\"004085616\"))\n",
    "print(validate_no_fodselsnummer(\"hello\"))\n",
    "print(validate_no_fodselsnummer(np.nan))\n",
    "print(validate_no_fodselsnummer(\"NULL\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "competent-thumb",
   "metadata": {},
   "source": [
    "### Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "isolated-spokesman",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_no_fodselsnummer(df[\"fodselsnummer\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "numeric-lying",
   "metadata": {},
   "source": [
    "### DataFrame + Specify Column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "equal-joining",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_no_fodselsnummer(df, column=\"fodselsnummer\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "complimentary-palestinian",
   "metadata": {},
   "source": [
    "### Only DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "authentic-sunday",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_no_fodselsnummer(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "apparent-strip",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
